The dataset I chose to analyze was the Prosper Loan Data. My financial history lends me a particular interest in learning more about the story of loans through the data.

Subset the Data

loans <- select(loanData, CreditGrade, Term, BorrowerAPR, BorrowerRate, 
                ProsperRating..Alpha., ListingCategory..numeric., 
                IsBorrowerHomeowner, CreditScoreRangeLower, 
                CreditScoreRangeUpper, FirstRecordedCreditLine, 
                OpenCreditLines, TotalCreditLinespast7years, 
                OpenRevolvingAccounts, OpenRevolvingMonthlyPayment, 
                InquiriesLast6Months, DelinquenciesLast7Years, 
                RevolvingCreditBalance, BankcardUtilization, 
                AvailableBankcardCredit, DebtToIncomeRatio, LoanNumber, 
                LoanOriginalAmount, MonthlyLoanPayment, Investors, 
                LoanOriginationDate, DateCreditPulled, StatedMonthlyIncome)

str(loans)
## 'data.frame':    113937 obs. of  27 variables:
##  $ CreditGrade                : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                       : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ BorrowerAPR                : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate               : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperRating..Alpha.      : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ListingCategory..numeric.  : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ IsBorrowerHomeowner        : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower      : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper      : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine    : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ OpenCreditLines            : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts      : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment: num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months       : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ DelinquenciesLast7Years    : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance     : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization        : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit    : num  1500 10266 NA 30754 695 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ LoanNumber                 : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment         : num  330 319 123 321 564 ...
##  $ Investors                  : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ LoanOriginationDate        : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ DateCreditPulled           : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ StatedMonthlyIncome        : num  3083 6125 2083 2875 9583 ...

Check Credit Rating Levels

ProsperRating..Alpha.

  A AA B C D E HR
29084 14551 5372 15581 18345 14274 9795 6935

CreditGrade

  A AA B C D E HR NC
84984 3315 3509 4389 5649 5153 3289 3508 141

Combine and Clean the Data

#Combine Credit Rating Systems
loans$ProsperCreditGrade <- 
  as.factor(ifelse(loans$ProsperRating..Alpha. == "", 
                   as.character(loans$CreditGrade), 
                   as.character(loans$ProsperRating..Alpha.)))

#Anazlyze and Remove "No Credit" Loans

loans <- filter(loans, ProsperCreditGrade !="NC")
loans <- filter(loans, ProsperCreditGrade !="")

#Add Credit Score System variable
#2013-09-07 14:08:48 - Last before transition
#2013-09-08 00:23:19 - First after transition
loans$DateCreditPulled <- ymd_hms(loans$DateCreditPulled)
loans$CreditSystem <- 
  (ifelse(loans$DateCreditPulled < "2013-09-07 14:08:59", "ScoreX", "FICO08"))
loans$CreditSystem <- 
  factor(loans$CreditSystem, levels = c("ScoreX", "FICO08"))

#Add Prosper Rating System variable
loans$ProsperRatingSystem <- 
  (ifelse(loans$DateCreditPulled < "2009-06-01 00:00:00", 
          "Old Rating System", "New Rating System"))
loans$ProsperRatingSystem <- 
  factor(loans$ProsperRatingSystem, 
         levels = c("Old Rating System", "New Rating System"))

#Add Stated Income Bucket Variable
loans$statedIncomeBucket <- 
  cut(loans$StatedMonthlyIncome, c(0, 4750, 483300), dig.lab = 6)

#Clean Up Data Frame
loans <- select(loans, -CreditGrade, -ProsperRating..Alpha.)
levels(loans$ProsperCreditGrade)
## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR" "NC"
loans$ProsperCreditGrade <- 
  factor(loans$ProsperCreditGrade, 
         levels = c("AA", "A", "B", "C", "D", "E", "HR"))
levels(loans$ProsperCreditGrade)
## [1] "AA" "A"  "B"  "C"  "D"  "E"  "HR"
#Create Average Credit Score
loans$avgCredit <- (loans$CreditScoreRangeLower+loans$CreditScoreRangeUpper)/2

Just over half of the borrowers were homeowners and the mean credit score for the group was approximately 696. The median original loan amount is $6500 at a mean rate of 19.28%.


Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
9.5 669.5 689.5 695.8 729.5 889.5 580

The plot is consistent with the median Average Credit Score of 689.5


Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0.00653 0.1563 0.2098 0.2188 0.2839 0.5123 24

I would expect to see a somewhat normal distribution here, which this is, but the spike around 0.35% indicates that this particular rate was more common than it should have been. I wonder is this is due to a special rate during one particular time. I will check this against other variables to see if I can find a trend.


Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 6 9 9.261 12 54 7463

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
2 17 25 26.77 35 136 629

Correlation Coefficient of OpenCreditLines and TotalCreditLinespast7years: 0.5868

Total Credit Lines Past 7 Years Summary among Median Open Credit Line Borrowers:

Min. 1st Qu. Median Mean 3rd Qu. Max.
9 19 25 26.56 33 85

The distributions of current open credit lines and total credit lines in the past 7 years are very similar, suggesting patterns of behavior among borrowers, however with a correlation coefficient of 0.59, the relation between the two is not a strong as the distributions indicate. The scatter plot shows a clear relationship given that open credit lines are included in the total credit lines statistic, but there is quite a bit of variability among borrowers with a median number of open credit lines. Total credit lines in the last 7 years among these borrowers ranges from 9 to 85!


It was interesting to see that Total Credit Lines, Open Credit Lines and Open Revolving Accounts were all slightly skewed distributions. The distributions of the Open Accounts being similar is no surprise as one is a subset of the other, but I found it compelling that the distributions of current and historical open accounts was relatively similar.


Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0 1 1.435 2 105 629

The majority of borrowers have 2 or less credit inquires in the last 6 months. Shockingly, one borrower had over 100!


Even on a modified scale, the frequency of borrowers with no delinquencies appears vastly higher than borrowers with one or more. It was surprising to see that borrowers with more than 50 delinquencies in the last seven years were approved for a loan.

Delinquencies Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0 0 4.159 3 99 919

BankCardUtilization Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0.31 0.6 0.5614 0.84 5.95 7463

Given what we saw with delinquencies, the high frequency of borrowers with 0% BankCard utilization comes as less of a surprise. What is counter-intuitive is the distribution above 0% utilization. It would appear that borrowers with any credit card debt are more likely to be utilizing more of their available credit than less. I wonder how this would look if we excluded those borrowers who received debt consolidation loans.

While both subsets have peak non-zero utilization close to 100%, it would appear that those borrowers with non-zero credit utilization who did not receive a debt consolidation loan were equally as likely to have little utilization as a higher amount of utilization.


Debt to Income Ratio Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0.14 0.22 0.2761 0.32 10.01 8482

Debt to Income Ratio Distribution by Home Owner Status:

Non-Homeowners appear to have a significantly lower debt to income ratio than homeowners, which makes sense given a mortgage’s impact on this statistic.


LoanOriginalAmount Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 4000 6500 8349 12000 35000

MonthlyLoanPayment Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 131.8 218.3 272.9 371.6 2252

As I expected, the distribution of monthly loan payments is quite similar to the one for original loan amount. Any variance is likely due to the difference in interest rates.


Investors Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 44 80.45 115 1189

Number of One Investor Loans: 27812

The distribution of Investors per loan is quite odd. While each loan has an average of 80 investors, nearly one quarter of the loans had only one investor. This explains the median of 44 investors being so much lower than the mean.


Univariate Analysis

Structure of the Dataset

The original Prosper Loan Dataset of 113,937 observations was refined, after removing loans with no credit information, to 113,665 observations. For the analysis, I narrowed the 81 variables to 26 that were related to basic loan and credit information.

The only ordered factor variables were CreditGrade and ProsperRating..Alpha. which were combined to create the variable ProsperCreditGrade with the following levels: AA, A, B, C, D, E, HR

Additionally:

The mean credit score was approximately 696. The median original loan amount is $6500 at a mean rate of 19.28%, with a mean monthly payment of $272.90. Mean credit card utilization is 56.1% and mean delinquencies over the last 7 years is over 4, however the mode for both variables was actually 0. The most common Debt to Income Ratio was approximately 0.20.

Main Features of the Dataset

The main features of the dataset are the Credit Score, Prosper Credit Rating and the Loan Interest Rate. Much of the analysis will be focused on how each is derived, how related they are to each other and how each affects or is correlated to other loan characteristics and loan distribution.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I suspect that Term, DebtToIncomeRatio, LoanOriginalAmount, Credit Score and Credit Rating may factor into BorrowerRate for each loan. I will be interested to see how correlated with each Borrower’s credit score the “credit history” variables (FirstRecordedCreditLine, TotalCreditLinespast7years, InquiriesLast6Months, DelinquenciesLast7Years) and “current credit health” variables (OpenCreditLines, OpenRevolvingAccounts, OpenRevolvingMonthlyPayment, RevolvingCreditBalance, BankcardUtilization, AvailableBankcardCredit, DebtToIncomeRatio) will be. Finally, I am hoping to investigate Investors further to see if any interesting trends arise.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables. First, I addressed CreditGrade and ProsperRating..numeric.. Prosper’s credit rating appears to over-gone a change in May and June 2009. These rating system variables were combined into the variable ProsperCreditGrade for analysis among the entire dataset. However, in order to conduct analysis comparing the two rating systems, I also created the variable ProsperRatingSystem which indicates whether a particular grade is from the new or old formula. Any loan that was not issued a rating was removed.

Next, I created avgCredit which averaged CreditScoreRangeLower and CreditScoreRangeUpper for easier analysis

Finally, in 2013 Prosper changed it’s source of credit scores from Experian to FICO, so I created the variable CreditSystem that indicates which provider was used when the credit score was pulled. I would like to investigate how the function was affected by this change, including number of loans issued per credit score as well as investor behavior. Source: http://blog.prosper.com/2013/09/08/prosper-system-upgrade-this-weekend/

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Before performing any analysis, any loan with no credit grade from Prosper was removed from the dataset and the original variables with these ratings were removed as they are redundant.

Due to the number of large (or small) outliers in the dataset across a number of variables, nearly all of the plots required limit setting in order to see the bulk of the data in any meaningful way. Examples of this are Credit Score, Inquiries in the last 6 months, Bankcard Utilization, Debt to income ratio and Investors.

Because the highest number of borrowers have zero delinquencies over the last 7 years and 0% bankcard utilization, I square-root transformed the scales in order to see more definition in the rest of the data. I did the same thing with the Monthly Loan Payment and Investors plots.

Finally, while I didn’t actually change any data, I separated the Bankcard Utilization plots by whether or not the Prosper loan was a debt consolidation loan. Presumably, many of these loans were intended to pay down credit card debt and would thus skew analysis on the population’s utilization.

Bivariate Plots

Credit Scores (“ScoreX”):

Table continues below
9.5 369.5 429.5 449.5 469.5 489.5 509.5 529.5 549.5 569.5
3 1 5 36 141 346 554 1593 1474 1357
Table continues below
589.5 609.5 629.5 649.5 669.5 689.5 709.5 729.5 749.5 769.5
1125 3600 4168 8655 10213 10534 9998 9083 7378 5582
789.5 809.5 829.5 849.5 869.5 889.5
4045 2338 1300 545 211 27

Credit Scores (“FICO08”):

649.5 669.5 689.5 709.5 729.5 749.5 769.5 789.5 809.5 829.5 849.5
3530 6135 5950 5459 3815 1871 1011 575 303 106 18

Interestingly, the number of loans given out by Credit Grade follows a somewhat normal distribution, with “C” or “D” graded borrowers receiving the highest number of loans, depending on the rating configuration of the site. I am curious whether this is by design from Prosper to group a broader subset of applicants into this category in attempt to really highlight those on the more extreme ends of the scale or if it simply reflects the type of applicants the site is getting. Presumably, better qualified borrowers have more and potentially better options than what Propser offers. Conversely, less qualified borrowers are likely not getting approved for as many loans on the site as their slightly better qualified counterparts.

When separating by Credit Score System and Prosper Rating System, the distributions become less normal, but the middle rated borrowers still account for the largest numbers of issued loans from the website. We do see that the percentage of loans going to “D”, “E”, and “HR” relative to the other rated loans seems to drop after the switch to FICO08. This could be due to a stricter rating process on Prosper’s end or due to Investors being more discriminating with the improved information.

This led me to wonder if it would be possible to determine if changes in Investor behavior were caused by this change. This will be investigated further with a multivariate plot.


Percentage of Loans with one investor: 24.47


Correlation Coefficient of Investors and Prosper Credit Grade: -0.3476

As we saw in the previous section, a large number of loans have only one investor. We can see this influence when plotting Investors by Credit Grade. The first quartile of A, B and C rated Borrowers is at or near one.



New Debt To Income Ratio Stats:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0.0001859 0.1875 0.2737 0.3115 0.3843 13.95 9795

Naturally, those applying for Debt Consolidation loans had a higher debt to income ratio prior to acceptance of the new loan. However, while borrowers of these loans likely remained at a similar debt to income level after securing the consolidation loan and presumably paying off other debt, non consolidation loan borrowers saw an increase in debt to income ratio, which was also to be expected.


Correlation Coefficient of BorrowerRate and LoanOriginalAmount: -0.3296

Correlation Coefficient of BorrowerRate and Prosper Credit Grade: 0.8792

BorrowerRate and LoanOriginalAmount are not particularly correlated, however BorrowerRate and Prosper Credit Grade seem to be highly correlated, as the statistic, separation in the plot and website would suggest. Yet, this does not tell the whole story. Previous iterations of this plot suggested that BorrowerRate went down as LoanOriginalAmount went up. Separating by Credit Grade reveals that interest rates are generally flat within a Credit Grade group, with the exception of the lower rated borrowers, who have greater variability. We see that BorrowerRate gets lower as Loan Amount goes up for the worst rated group. I suspect that this is due to the fact that the only “HR” borrowers eligible for higher loan amounts have something in their credit profile that affords them more eligibility than the average “HR” borrower. For example, these higher loan amount “HR” borrowers may have better credit scores than the average “HR”, giving them a lower interest rate, but have a large number of delinquencies in the past 7 years, which relegates them to the “HR” category.


Correlation Coefficient of BorrowerRate and Term: 0.0201


Correlation Coefficient of BorrowerRate and DebtToIncomeRatio: 0.06296

Clearly, the Prosper Credit Rating has a much bigger affect on the loan’s interest rate than the original loan amount, term or the borrower’s debt to income ratio.


Correlation Coefficient of BorrowerRate and avgCredit: -0.4871


It looks like Credit Score has a similar effect on loan interest rate to Prosper Credit Grade. One would assume that the two must be closely related.

Correlation Coefficient of ProsperCreditGrade and avgCredit with the old rating system and ScoreX: -0.9788

Correlation Coefficient of ProsperCreditGrade and avgCredit with the new rating system and ScoreX: -0.5823

Correlation Coefficient of ProsperCreditGrade and avgCredit with the new rating system and FICO08: -0.5927

From the plots and the correlations above, I believe the original Prosper Credit Rating may have been a simple aggregation of credit scores into easier to read letter grades. After the new rating formula was implemented, credit score was clearly heavily involved, but no longer the sole factor in assigning a Credit Rating.


Looking at this from another angle, I wonder what the mean credit score is for a given interest rate.

Consistent with the previous findings, those with higher credit scores qualify for better interest rates. This is by design and controlled by the website. It was interesting however to see an uptick in mean and median credit scores at the very high end of interest rates. I wonder if this is due to the fact that the higher interest rates are associated with either much larger original loan amounts or much longer terms that only better qualified borrowers would be approved for.

Term Summary of Loans with Borrower Rate Greater Than or Equal to 0.29:

Min. 1st Qu. Median Mean 3rd Qu. Max.
36 36 36 37.87 36 60

Term Summary of Loans with Borrower Rate Less Than 0.29:

Min. 1st Qu. Median Mean 3rd Qu. Max.
12 36 36 41.34 36 60

Original Loan Amount Summary of Loans with Borrower Rate Greater Than or Equal to 0.29:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 2500 4000 3910 4000 25000

Original Loan Amount Summary of Loans with Borrower Rate Less Than 0.29:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 4000 7500 9091 13500 35000

Number of Loans with Borrower Rate between 0.275 and 0.3: 7452

Number of Loans with Borrower Rate Between 0.3 and 0.325: 9257

I was quite wrong in my estimate for the reason behind why the average credit score is higher for those borrowers of higher interest rate loans. I even surmised that the lack of relative volume of loans above 30% could be skewing the numbers, but there were even more loans with rates from 30-32.5% than from 27.5-30% interest.


Correlation Coefficient of Investors and BorrowerRate: -0.274

Correlation Coefficient of Investors and BorrowerRate on loans with more than 1 investor: -0.4176

As we have seen previously, a large number of loans with only one investor may be effecting the data. Here, the correlation between Investors and Borrower rate went up when excluding the nearly 25% of loans with only one investor. I am curious if this is true throughout the period of time covered by the dataset.


An explosion of One-Investor loans occurs in early 2013 and I wonder why. To see it reach levels of over 4000 loans per month when the previous high was just over 200, is shocking to say the least. I want to see if we can investigate investor behavior further by looking at investor relationships with other variables.


Correlation Coefficient of Investors and avgCredit: 0.2831

While the plot suggests some relationship, Investors per loan is not very highly correlated with credit score. My guess on why the plot looks to have a stronger relationship is due to the high number of one investor loans in the 650-750 credit score range.


Correlation Coefficient of Investors and DebtToIncomeRatio: 0.004068

Correlation Coefficient of Investors and LoanOriginalAmount: 0.3803

Investors do not seem to factor Debt to Income ratio into investment decisions at all, with a correlation that low. It does however seem that higher loan amounts are likely to attract a higher number of investors. I suspect that this is due to investors wanting to spread the risk of default.


The following plots will be faceted for the purpose of separating the data only, rather than for comparing the data among the faceted variable.

Correlation Coefficient of avgCredit and TotalCreditLinespast7years: 0.1018

Correlation Coefficient of avgCredit and InquiriesLast6Months: -0.271

It would seem that credit inquiries in the last 6 months or total credit lines in the last 7 years are not significant factors in determining credit score.


Delinquencies Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0 0 4.159 3 99 919

Number of borrowers with zero delinquencies in the last 7 years: 76281

Number of borrowers with at least one delinquency in the last 7 years: 36465

Percentage of borrowers with Zero or One delinquency in the last seven years: 70.6 %

It is interesting to see that those borrowers with only zero or one delinquency in 7 years (which make up the majority of the dataset) have a measurably higher average credit score than those with more delinquencies. Borrowers with 3-50 delinquencies have approximately the same median credit score and borrowers with approximately 50-80 delinquencies in the last seven years generally have the same median credit score. This suggests that credit rating agencies group people when factoring this statistic into credit scores.


Correlation Coefficient of avgCredit and DelinquenciesLast7Years with “ScoreX”: -0.2596

Correlation Coefficient of avgCredit and DelinquenciesLast7Years with “FICO08”: -0.3072

While there appears to be a downward trend in average credit score as delinquencies go up, the two variables are not highly correlated enough to assume causation. The biggest difference in these numbers is between those with zero or one delinquency and all others.


OpenCreditLines Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 6 9 9.261 12 54 7463

The most notable feature of the plot above is that those borrowers with very few open lines of credit actually have lower average credit scores, suggesting that some debt is actually good. I wonder if each credit scoring system treats open credit lines the same.

Interestingly, the two systems differ slightly when it comes to open credit lines. While ScoreX slightly rewards those with more open credit lines than fewer, FICO08 shows slightly lower average credit scores for those borrowers with 3-5 open credit lines as compared to both those with none and 6+ alike.


OpenRevolvingAccounts Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 4 6 6.979 9 51

OpenRevolvingAccounts effect on credit score is very similar to that of OpenCreditLines, which makes sense as revolving accounts are likely included in all credit lines. Oddly though, we don’t see a similar effect on credit score with revolving accounts under FICO08 as we did with all credit lines. It appears that under this calculation, the number of open revolving accounts is not factored into the credit score much if at all.


Correlation Coefficient of OpenRevolvingMonthlyPayment and RevolvingCreditBalance: 0.761

As expected, OpenRevolvingMonthlyPayment and RevolvingCreditBalance are highly correlated, with much of the variance having to do with interest rate and how the debt is spread among open accounts.


Correlation Coefficient of BankcardUtilization and AvailableBankcardCredit: -0.3507

Correlation Coefficient of OpenRevolvingMonthlyPayment and BankcardUtilization: 0.2979

Having some available credit is better than none, but there are diminishing returns after $10,000-$20,000 of available credit.

When comparing many of the credit metrics to credit score, it would appear once again that some debt is good. The global peak (or at least local peak) credit score is often around the first quartile of many of the metrics, such as OpenRevolvingMonthlyPayment, OpenRevolvingMonthlyPayment, OpenRevolvingAccounts, Bankcardutilization and DebtToIncomeRatio. While the faceting above was simply intended to account for the fact that each system derived scores slightly differently, some really interesting differences were shown. Each model seems to factor Debt To Income Ratio and Open Credit Accounts very differently.


Correlation Coefficient of AvailableBankcardCredit and BankcardUtilization: -0.3507

Sadly, given that other variables have an effect on the plot above, I struggled to find anything meaningful or interesting.


Correlation Coefficient of AvailableBankcardCredit and BankcardUtilization: -0.02973

While I would have expected those with more delinquencies to have higher Bankcard Utilization (both indicators of poor credit management), it seems that the two variables are not related.


I expected homeowners to have more delinquencies as they likely have a larger debt burden than most non-homeowners, but they indeed have less. This is a more nuanced discussion however. Are those with higher delinquencies less likely to own a home due to poorer credit management? Does owning a home condition a person to pay bills on time? This definitely requires more investigation.


Correlation Coefficient of avgCredit and DelinquenciesLast7Years: -0.2627

Delinquencies and average credit score are not highly correlated and after plotting delinquencies by Prosper Credit Grade, I see why. Because of the number of borrowers with zero delinquencies in the last 7 years, the median number of delinquencies across all grades is zero, which skews any potential correlation. The clear difference we see is in the third quartiles of the poorer credit grades.


While delinquencies are not highly correlated with credit score, it does seem to be something that is factored into human decisions about credit viability. There is a clear decline in average number of open non-revolving lines of credit as delinquencies increase. As non-revolving lines of credit are typically approved by an underwriter who would look at more factors than just a credit score, it is reasonable to infer that this decline we see is deliberate.


Similarly, we see that less Investors are likely to invest in a single loan if the borrower has a higher number of delinquencies.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The analysis revealed two major shifts in the dataset that had an effect on a number of the variables throughout. First, in 2009 Prosper changed its proprietary, letter-based credit rating system. Under it’s old system, each letter grade was almost equally likely to have been assigned, with the most frequent grades being “B”, “C” and “D”. Under the new system, the distribution is more normal and while “C” is still the most frequently assigned grade, the highest and lowest grades “AA” and “HR” were much assigned much less frequently than all other grades.

Second, in 2013 Propser migrated it’s source of credit scores from Experian ScoreX to the more widely used FICO08. The credit scores in each source are calculated in very different ways, so it is almost impossible to conduct any analysis on scores among the entire dataset, without separating by which source was used to create the score. The dataset placed borrowers into groups by credit score and while the groups in each source were generally 20 points apart, the old source gave scores all the way from 0 to 899 but the new source gave scores from 640 to 859. It is highly unlikely that prosper simply decided to limit borrowers by that credit score range after the switch, so we can assume that it is just a fundamentally different rating system.

The original Prosper grading system appears to have relied heavily on credit score. Under the new rating system, borrowers with higher credit scores generally have higher Prosper grades, but the system seems to be much more nuanced, factoring other variables and not just credit score. Regardless of rating system, borrowers with higher grades garner more investors per loan, on average.

A loan’s amount and term do not appear to have an appreciable effect on the loan’s interest rate. Debt-to-income ratio seems to be factored in slightly, but Credit Score and Prosper Credit Grade are the highest correlated with interest rate. While those with lower credit scores generally have higher interest rate loans, the average credit score among those with the highest interest rate loans (>30%) is actually slightly higher than those with lower interest rate loans around 30%.

Credit scores are likely calculated using many of the “credit history” and “credit health” variables that exist in this dataset, but outside of OpenRevolvingMonthlyPayment, I was unable to find many strong correlations. However, I did notice an interesting phenomena when comparing many of the variables to credit score, where a local maximum would present itself around the first quartile of the variable. These variables are all indicators of current debt or credit utilization, suggesting that carrying a small amount of debt may actually lead to slightly higher credit scores than would be expected.

Finally, while I faceted some of the plots above by Credit Score Source in order to more accurately analyze a particular variable’s effect on the score, some interesting differences in the models was shown. For instance, each model seems to factor Debt To Income Ratio and Open Credit Accounts very differently

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was fascinated that nearly 1/3 of borrowers in the dataset had at least one delinquency in the last 7 years. Among those with at least one, the average number of open non-revolving credit lines declines as the number of delinquencies in the last 7 years goes up. While this may be due to better credit management, my guess is that those with a greater number of delinquencies are less likely to be approved for a non-revolving line of credit. This theory can be defended by looking at the number of investors per loan compared to number of delinquencies. Those borrowers with less than five delinquencies are much more likely to receive a higher number of investors than those with more than five.

On the topic of investors, it was also interesting to see the median investors per loan over time. There appears to have been a huge shift in early 2013 driving down median investors per loan to as low as 1 in 2014. This is confirmed when we look at the number of “One Investor Loans” per month and see massive increases in 2013 and 2013.

What was the strongest relationship you found?

By far, the strongest relationship I found was that of Credit Scores and Prosper Credit Grade with the old rating system. With a correlation coefficient of -0.98, it is very likely that the rating system was simply an aggregation of a group borrowers based on credit score alone. In fact, there are only two instances where a credit grade’s minimum credit score matches that of the lower grade’s maximum.


Multivariate Plots

Looking at this plot alone, it is easy to be confused as to why the distributions are so different between the two credit scoring systems. What caused investor behavior to change so much after the change to FICO08?

Investor Summary for “ScoreX” Loans:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 24 62 96.94 135 1189

Investor Summary for “FICO08” Loans:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 31.8 7 714

Investors per Day in the “ScoreX” Era: 2879

Investors per Day in the “FICO08” Era: 4972

As we have seen previously, there was a massive explosion of one investor loans in early to mid-2013. The switch to FICO08 occurred in late 2013, so it is unlikely to be the cause of the major shift in investor behavior. Further skewing the plot is the fact that the site saw much greater activity in the FICO08 era than before the change. It is no wonder we see such a disparity in the median and mean investors per loan during this time.


Consistently, we see that nearly every median investors per loan by credit score is one. Given the differences we’ve seen between the median and mean, I am curious to see what the distribution of means by credit score would look like.


We do again see that mean investors per loan is down in the FICO08 era as compared with the ScoreX era. However, it is good to see a definitive increase in mean investors per loan as credit scores go up. This is to be expected and likely due to higher investor confidence.

Percentage of borrowers with a credit score above 825: 1.942 %

With such a low percentage of borrowers with a credit score above 825, it is likely that the decline we see in the plot above is either related to some anomaly in the data or a regression to a more appropriate level of investors per loan.


Loans Per Term:

12 36 60
1613 87507 24545

Faceting the loans by term, we see that loan amounts from $3500-6000 command the highest interest rates in their group, with slightly higher and slightly lower loan amounts commanding lower rates. Variability in interest rates per original loan amount seems to vary by term as well, even though some of the median rates in each original loan group do not seem to follow a pattern term to term.


We would expect to see a good deal of linearity when comparing Loan Amount to Monthly Payment. Obviously the majority of the variance in this case is due to the interest rate. The 36 month loans do seem to have a higher variance than the other loans, so it will be interesting to see if this holds true.

12 Month Loan APR Variance: 0.008147

12 Month Loan APR Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04935 0.1466 0.2203 0.2162 0.2917 0.3584

36 Month Loan APR Variance: 0.007312

36 Month Loan APR Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00653 0.1486 0.2097 0.2195 0.2926 0.5123

60 Month Loan APR Variance: 0.003304

60 Month Loan APR Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07111 0.1718 0.2093 0.2168 0.2572 0.3584

Surprisingly, 12 month loans had greater variability than 36 or 60 month loans.


Given that we know how correlated interest rate and Prosper Credit Grade are, it is not surprising to see how similar these plots appear. This plot however gives a clearer view of the separation between the groups of borrowers on the 36 and 60 month loans.


Among Debt Consolidation Borrowers, What is the Relationship Between Open Credit Lines and Debt to Income Ratio?

I would expect that as Open Credit Lines increases, so too will Debt to Income Ratio. However, even with jittering and changing the alpha level of the points, I don’t feel that this plot gives us an idea of the relationship between the two. My sense is that a boxplot would do a better job of this.

With this plot we can see that non-homeowners generally have higher debt to income ratio per open credit line than homeowners.


Correlation Coefficient of MonthlyLoanPayment and BorrowerAPR: -0.2274

Given how integral a loan’s interest rate is to it’s monthly payment, I would have expected a higher correlation between the two. Obviously a loan’s term and original amount are much bigger factors in the calculation of a monthly payment. The plot above has some really interesting discreteness that shows what we would expect, which is lines representing common original loan amounts moving in the plot as the interest rate affects the ultimate monthly payment. The most interesting thing to me however is how the pitch and variability of these lines change with each Prosper Credit Grade group and as the loan amount increases. First, there is generally much less variability in potential interest rates among higher graded borrowers. The lowest graded group has interest rates ranging from 5% to over 40%, though they are much less likely to be approved for higher loan amounts. Next, in each Prosper Credit Grade facet, we see that the pitch of the discrete lines sharpens as the original loan amount increases. This is because increases in the interest rate will cause larger increases in monthly payments for larger loans. This is not surprising, but still interesting to see plotted.


There are a few interesting things to look at in this plot. First, we see visually the difference in average credit score assigned to each Prosper Credit Grade under the old and newer rating systems. Given what we know about the differences in scores issued under ScoreX and FICO08, I would have expected to see a much bigger difference between the two under the same Prosper Grading system. Next while it does not seem that debt to income ratio was factored heavily into credit under ScoreX, we see some odd trends in the FICO08 data. The credit scores of the top three highest graded groups seem to increase with a modest amount of credit utilization, however drop sharply above 50% debt to income ratio. The scores of lower graded borrowers (“E” and “HR”) however, see more consistent improvements to credit score with increases in credit utilization, which is counter-intuitive.


I wonder if we will see different results when splitting this by loan amount.

Given that investors make money on the interest of loans, I would have expected to see more investors per loan as interest rates increased. That we essentially see the opposite on loans with an interest rate above 10% is indicative of the fact that higher risk borrowers receive higher interest loans, which make them less desirable to investors.


In all instances except the lowest credit score range, the new credit score source meant lower mean interest rates and less variability within each score group. Given we know that the scores from each of these systems are calculated very differently and have a different range of possibilities, I believe that the variance we see is due to different populations occupying the score ranges, rather than the new system being used by Prosper to issue lower interest rates for the same credit score range.


Correlation Coefficient of avgCredit and LoanOriginalAmount: 0.3522

A borrowers credit score obviously does not dictate the size of the loan they receive, but it does appear that borrowers with higher credit scores are afforded the opportunity to receive a wider range of loan amounts than borrowers with lower scores. For example, the median loan for borrowers with a credit score lower than 630 was less than $4,000. Whereas borrowers with higher credit scores received median loans as high as $15,000. The variance around this median generally increases with credit score.


It is interesting to see the change in variability of interest rates and credit scores among each credit rating group. Whereas the group rated “AA” generally all have high credit scores and low interest rates, “C” and “D” rated borrowers have much more variability in both metrics. “HR” rated borrowers are generally relegated to the highest interest rates, even though some in this group have credit scores similar to those in the “AA” group.


Regardless of credit score, a borrower’s interest rate typically increases as their Prosper Credit Grade decreases.


While loans with borrowers that have higher Debt to Income Ratios generally attract fewer investors per loan, oddly, Non-Debt Consolidation loans do not suffer this decline as drastically. As we have seen, those borrowers securing debt consolidation loans have a higher debt to income ratio at the time of application. One would assume that a high debt to income ratio would not affect investor behavior on these loans as much as with other loans, yet it does. One possible reason for this decline is that investors may actually feel more confident in the Debt Consolidation Loans in general and thus are comfortable investing a higher percentage of the original loan amount.


I wanted to find out how differently each credit score provider factored delinquencies into the score calculation. We previously found that the biggest difference in credit score with regard to delinquencies was between borrowers with zero or one delinquency and everyone else. We see this again here, but interestingly, it appears that scores under the FICO08 system take a much bigger hit with 2 or more delinquencies. With this system, not only are the highest scores reserved for those with little to no delinquencies, those with 10 or more in the last seven years have credit scores that are effectively maxed out between 700 and 725 and much more likely to be in the 600’s. On the flip side, borrowers with more than 30 delinquencies under the ScoreX system have scores ranging from 550 to over 800!


Correlation Coefficient of avgCredit and OpenRevolvingMonthlyPayment: 0.1373

I initially wanted to see if open credit monthly payments negatively affected credit score, but quickly realized that someone making $1000/month would be much more affected by a $500 monthly credit card bill than would someone making $15,000/month. For the plot above, I split the data into the top half and bottom half of earners by stated monthly income. I expected to see a significant decline in average credit score among the bottom half of earners as the actual dollar amount of monthly revolving payment went up. Oddly, it stayed relatively flat and even matched the trend of the top half earners. I suspect that this is not something factored into score as much as it is factored into loan eligibility and underwriting.


When looking at the plot, 4 facets stand out to me as different than the rest. #4 (Personal Loan), #8 (Baby & Adoption), #10 (Cosmetic Procedure), #11 (Engagement Ring) are the four that I think call out for further investigation. While the comedian in me wants to show the variability or positive relationship in these facets as consistent with other poor life choices, as shown by their type of loan applied for, I believe the actual answer is much simpler and less funny. In total, these four loan categories account for just over 2.5% of the dataset; most of which is the personal loans. With more examples, I would expect to see these plots exhibit regression to the mean as with the other plots.


I was curious to see if, on average, the original loan amount increases when stated monthly income increases. The thought is that those with more money either need more money to maintain a lifestyle or are eligible to receive (and more importantly pay back) more money. Faceting by loan type seems to have helped highlight this as the biggest differences in loan amount between lower income earners and higher earners were the following loan categories: Debt Consolidation, Home Improvement, Business, Large Purchase, and Wedding. With all of these categories, loans are likely higher among top income earners in order to maintain a lifestyle (Debt Consolidation, Home Improvement, Large Purchase, Wedding) or because the borrower is more likely to produce a return on investment (Debt Consolidation, Home Improvement, Business).


Other than loans that did not report a category, Student Loans were the only group where the borrower’s debt to income ratio went up as their borrowed loan amount goes up.


As loan type would not affect the relationship between two credit variables that were recorded before the loan was processed, each of these looks very much the same.


The number of delinquencies in a borrower’s credit history seems to affect credit score in a similar fashion among a given score provider’s data, regardless of whether the borrower was a homeowner. It seems as though all debt is the same when it comes to this part of the credit score calculation.


Delinquency Summary of Borrowers with the Old Rating System :

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0 0 0 5.677 6 99 919

Delinquency Summary of Borrowers with the New Rating System:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 3.659 2 99

The new Prosper Rating System is much less forgiving on number of delinquencies in the borrower’s recent history. Not only are the medians equal or lower across the board, the third quartile numbers among the lower rated borrowers are much lower with the new system. One example that highlights this change is, looking only at the comparison of these variables, “C” rated borrowers in the old system appear to share compositional characteristics with “E” rated borrowers with the new system. I found MANY articles that suggested Prosper has struggled with repayment of it’s investors, so I wonder if the new rating system weighed delinquency more heavily to combat this.


Higher rates of delinquencies don’t necessarily mean lower investors per loan for a given Prosper Credit Grade, but each subset does see greater variability in investors per loan as the borrower’s delinquencies go up.


Investor Summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 44 80.45 115 1189

Investor Summary of “ScoreX” Loans:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 24 62 96.94 135 1189

Investor Summary of “FICO08” Loans:

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 31.8 7 714


Prosper Grade Linear Regression with “ScoreX” Loans:

  m10 m11 m12
(Intercept) 22.120***
(0.131)
22.190***
(0.131)
22.521***
(0.132)
log(avgCredit) -3.193***
(0.020)
-3.207***
(0.020)
-3.266***
(0.020)
AvailableBankcardCredit -0.000***
(0.000)
-0.000***
(0.000)
-0.000***
(0.000)
BankcardUtilization 0.097***
(0.005)
0.094***
(0.005)
0.100***
(0.005)
InquiriesLast6Months 0.006***
(0.001)
0.006***
(0.001)
0.006***
(0.001)
DelinquenciesLast7Years 0.002***
(0.000)
0.002***
(0.000)
0.002***
(0.000)
OpenCreditLines 0.000
(0.000)
-0.002***
(0.000)
-0.001*
(0.000)
as.numeric(FirstRecordedCreditLine) 0.000***
(0.000)
0.000***
(0.000)
0.000***
(0.000)
OpenRevolvingMonthlyPayment 0.000***
(0.000)
0.000***
(0.000)
0.000***
(0.000)
RevolvingCreditBalance -0.000***
(0.000)
-0.000***
(0.000)
-0.000***
(0.000)
TotalCreditLinespast7years
0.001***
(0.000)
0.002***
(0.000)
DebtToIncomeRatio

0.026***
(0.002)
R-squared 0.475 0.476 0.509
adj. R-squared 0.475 0.476 0.509
sigma 0.390 0.390 0.378
F 7784.336 7027.846 6692.438
p 0.000 0.000 0.000
Log-likelihood -37014.548 -36956.756 -31588.568
Deviance 11793.671 11776.074 10119.826
AIC 74051.097 73937.511 63203.136
BIC 74152.922 74048.593 63322.337
N 77409 77409 70919

Old Propser Rating Linear Regression:

  n1 n2 n3
(Intercept) 18.421***
(0.018)
19.525***
(0.022)
19.560***
(0.025)
avgCredit -0.022***
(0.000)
-0.024***
(0.000)
-0.024***
(0.000)
AvailableBankcardCredit
0.000***
(0.000)
0.000***
(0.000)
BankcardUtilization

-0.016*
(0.006)
R-squared 0.958 0.967 0.967
adj. R-squared 0.958 0.967 0.967
sigma 0.379 0.324 0.324
F 645122.395 311164.891 207589.217
p 0.000 0.000 0.000
Log-likelihood -12667.381 -6256.871 -6201.343
Deviance 4055.038 2248.860 2234.677
AIC 25340.762 12521.742 12412.687
BIC 25365.507 12553.628 12452.531
N 28233 21408 21350

As we would expect and have show previously, the distribution of credit grades is somewhat normal, with “C” rated borrowers being the most frequent. Also, with each credit grade, borrowers with lower score comprise a larger percentage of the group and those with higher scores comprise less. None of this is surprising given our previous findings, but it is interesting to see the composition visually.


Throughout the analysis, I became intrigued by the question of how Investor behavior is affected by various factors, events of variables. The plot above shows what proportion of total investors per month was allocated to each Prosper Credit Grade. In other words, which letter grades draw the attention (and funds) of the investors and how has this changes over time? Surprisingly, we see major changes throughout the data, where loans become more or less popular. If I were to push this further, I might look at major changes happened on the site to see if these coincide with the shifts we see. This would include changes to the rating systems or any legal action, which is what I suspect caused the huge gap around month 50.


Using a linear model to explain the variation in Prosper Credit Grade

I subset the data to include only loans with the newest rating system and credit scores from FICO08, so I can try to come up with a formula that would produce an estimate for Prosper Credit Grade during this era. Applying a log transformation to Prosper Credit Grade and avgCredit produced a plot that appeared to be the most linear, which will hopefully give us a decent base on which to create a linear model. The plot below will help determine other variables that will help refine the model.

  o5 o6 o7
(Intercept) 33.222***
(0.311)
33.199***
(0.327)
32.977***
(0.328)
log(avgCredit) -4.951***
(0.047)
-4.948***
(0.050)
-4.921***
(0.050)
AvailableBankcardCredit -0.000***
(0.000)
-0.000***
(0.000)
-0.000***
(0.000)
BankcardUtilization 0.087***
(0.009)
0.087***
(0.009)
0.097***
(0.009)
DebtToIncomeRatio 1.086***
(0.016)
1.086***
(0.016)
1.089***
(0.016)
InquiriesLast6Months 0.088***
(0.002)
0.088***
(0.002)
0.088***
(0.002)
DelinquenciesLast7Years
0.000
(0.000)
0.000
(0.000)
as.numeric(FirstRecordedCreditLine)

0.000***
(0.000)
R-squared 0.514 0.514 0.514
adj. R-squared 0.514 0.513 0.514
sigma 0.329 0.329 0.329
F 5671.893 4726.420 4063.001
p 0.000 0.000 0.000
Log-likelihood -8255.884 -8255.858 -8235.549
Deviance 2908.254 2908.249 2903.855
AIC 16525.769 16527.717 16489.098
BIC 16583.159 16593.305 16562.884
N 26864 26864 26864

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

During the span of time covered by the dataset, Prosper changed how it’s proprietary grading system was calculated as well as it’s provider of credit scores. After both changes, large changes in investor behavior could be observed. The changes to it’s grading system and credit score provider likely coincided with or even inspired a boom in investor confidence. While the number of individual investments per day went up significantly after the change in providers, from 2879 per day to 4972, the median investors per loan went from 62 during the ScoreX era to 1 after FICO08 was introduced. Also, with the new Prosper rating system, interest rates per credit grade appear to vary much less than with the old system, indicating that common rates are likely assigned to particular grades and/or the requirements for each grade had been strengthened. Blog posts around this time indicate that investors with large amounts of capital were flooding to the site and fully funding loans more often than before, which is consistent with what we see in the data.

While the lowest interest rate loans were the most likely to have a high investor count and investors decline as interest rates go up, almost all loan types have increased investors at the higher end of their possible interest rates. I suspect that as interest rates go up, the borrowers generally present more risk to the investors, leading to less investors being interested in funding the loans. However, at the higher end of the spectrum, investors likely cant resist the potential return from the favorable rates.

In all instances except the lowest credit score range, the new credit score source meant lower mean interest rates and less variability within each score group. Given we know that the scores from each of these systems are calculated very differently and have a different range of possibilities, I believe that the variance we see is due to different populations occupying the score ranges, rather than the new system being used by Prosper to issue lower interest rates for the same credit score range.

While loans with borrowers that have higher Debt to Income Ratios generally garner fewer investors per loan, oddly, Non-Debt Consolidation loans do not suffer this decline as drastically. As we have seen, those borrowers securing debt consolidation loans have a higher debt to income ratio at the time of application. One would assume that a high debt to income ratio would not affect investor behavior on these loans as much as with other loans, yet it does. One possible reason for this decline is that investors may actually feel more confident in the Debt Consolidation Loans in general and thus are comfortable investing a higher percentage of the original loan amount. Other than loans that did not report a category, Student Loans were the only group where the borrower’s debt to income ratio went up as their borrowed loan amount goes up.

When comparing monthly payment to original loan amount, we see a fairly linear relationship, with the majority of the variation due to APR. When split by term, we see that loans with a 12 month term had the greatest variance in APR, but the 36 month loans had the highest mean and the most extreme outliers, which are also indications of variability. This is verified by the plots.

Were there any interesting or surprising interactions between features?

It was fascinating to see the comparison between debt to income ratio and average credit score. Among most of the Prosper Grade groups, higher debt to income ratio actually lead to a higher average credit score, which is counter intuitive. The credit score provider and Prosper Rating System also had an effect on the relationship, with each combination producing very different results. The highest Prosper graded borrowers were the only group to consistently see a decline in credit score with increased debt to income ratio, which begs the question: Is a little bit of debt actually good for your credit?

The relationship between Open Credit Lines and Debt to Income was much less lineal or dramatic than I expected as well. In fact homeowners in the dataset with 40 or more open lines of credit actually have a lower debt to income ratio on average than those with only 35 open lines of credit. Non-Homeowners with more than 35 open lines of credit generally have a higher debt to income ratio than homeowners, but there is not a significant amount of difference between the two subsets among those with less than 35 open lines of credit.

Next, when comparing Delinquencies to Credit Score, it was interesting to see some borrowers that were assigned grades of ‘AA’ or ‘A’ that had much lower credit scores or higher numbers of delinquencies during the last 7 years, yet retained their relatively high grade. Similarly, there were some borrowers with the low grade of ‘E’, yet credit scores of nearly 750 and very few delinquencies. This was much less common after the switch to FICO08 credit scores, however. With the switch to the new rating system and credit score provider, it would appear that standards for each grade were strengthened. One good example of this is seen when looking at delinquencies per Prosper Grade and realizing that borrowers with higher delinquencies were much less likely to be approved for loans on Prosper after the switch.

Finally, it was interesting to see that regardless of the length of the term, those borrowers of loans from $3500-6000 generally received the highest APR. Bigger and smaller loans generally lead to lower APRs, whereas I would have expected this to be more linear.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

After noticing a seeming linear relationship between credit scores and Prosper credit grades under the old rating system, I created a linear model using each of these variables and two related to bank card utilization. I found that credit score and bank card utilization accounted for 96.7% of the variance in Prosper Credit Grades.

However, this felt like cheating as the old rating system was extremely simplified, so I set out to create a model that would predict the Prosper credit grade with the new rating system as well as the new credit score provider. Unfortunately, even with the inclusion of a number of personal credit variables, the model was only able to account for approximately 51.4% of the variance in the grade. I suspect that the new rating system might have taken into account past activity on Prosper, the variables for which, I did not include in my subset of the data.


Final Plots and Summary


Plot One

Description One

While every borrower in the dataset was assigned a Credit Grade by Prosper, and nearly every loan listed the borrower’s credit score range, fundamental changes to how each was derived occurred during the time span covered by the dataset. Understanding these differences informs all future analysis of this data with respect to each variable.

First, the new Prosper Rating System is much more nuanced than the old system and less linear too. With a Pearson product-moment correlation coefficient of nearly 0.98, the relationship between Prosper Credit Grade and Average Credit Score using the old Prosper Rating System is almost perfectly linear. The same relationship under the new rating system produces a correlation coefficient of approximately 0.55, suggesting that the new rating system used much more than credit score to calculate each grade.

Second, the credit scores provided by FICO08 have a different range and distribution than those provided by ScoreX. Though the scores from each provider have an identical median of 689.5 and a similar mean (694.4 for ScoreX and 700 for FICO08), the range of ScoreX scores from 9.5 to 889.5 was vastly different than the range of FICO08 scores (649.5 to 849.5). Furthermore, visual inspection of the distributions reveal how truly different the subsets are.


Plot Two

Description Two

More than any other variable, Prosper appears to assign interest rates to loans based on the borrower’s Credit Grade. The Credit Grade itself if calculated by a number of these variables, but none are individually correlated to the interest rate more than the Credit Grade. For example, approximately 20% of the variation in Borrower Rate is explained by a loan’s term, 33% is explained by a loan’s amount but nearly 88% of Borrower Rate variation is explained by the Credit Grade.


Plot Three

Description Three

Ultimately, the mission of Prosper is to unite borrowers with investors, so it is interesting to investigate what types of borrowers are attracted to over time. The plot above shows total investments per quarter over the length of the dataset, along with indications of when the Prosper Grading System changed as well as when the Credit Score Provider changed. I removed the first few months of data, as they were heavily skewed toward higher rated borrowers. Given the remaining data, it is safe to assume this is an anomaly due to the uncertainty of a new service. After the Grading System change, investors began investing more heavily in the lower rated borrowers, each of which saw a decline beginning around 2012. Looking at the back end of the timeline, we see that borrowers rated A, B and C saw a large uptick in investments relative to their positions 1-3 years prior.


Reflection

With well over 100,000 data points, it can be easy for nuance to be lost among a sea of averages. However, after breaking down the dataset and highlighting key variables, I have been able to tell stories with once muddy information. I have learned that rating credit-worthiness is an incredibly complex and proprietary process and that a few powerful people or algorithms are capable of having a massive impact on the borrower. Look for instance at how differently credit scores are calculated and apportioned or how the change in Prosper’s grading system impacted how investors analyzed and judged borrowers.

A number of things surprised me in the dataset, such as how many borrowers with a high number of delinquencies in their past were given loans or how many borrowers were willing to take loans with interest rates higher than many available credit cards. But, I was also able to confirm a number of my theories and expectations as well. For instance, credit score and credit grade are both highly correlated with loan interest rates. Given that the website assigns rates based on these variables, it comes as no surprise.

An additional theory that I sought to confirm was that when it comes to credit rating and credit scores, a little debt utilization is actually a good thing. I was pleased to find that, based on a number of metrics, those around the 25th percentile of available credit utilization were generally associated with higher credit scores than those with higher or lower utilization.

One major limitation I see with this dataset is that it spans one of the most tumultuous economic time periods in the history of the United States. Time-based conclusions drawn from this dataset should consider what impact economic conditions had on the observed variables.

Finally, while I would have loved to come up with a golden model to perfectly predict a borrower’s credit score, I found that it is far more complex than I originally expected. One can be sure that each company’s model is very proprietary, focusing on factors that they specifically feel are more important. Given more time, broader data and a greater number of variables, I would be interested in “reverse engineering” a credit score, to find out how to boost my own score!